Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: Richard BBC Archives Date: May 11, 2004 9:01pm
Forum: petabox Subject: Selective powering of large petasites

I don't want to lower the tone here with my lack of knowledge, but media archivist are desperately interested in hard-drives vs tape for the future of our large shelf-based holdings.
The BBC has 85km of shelves, which translates very roughy (digitised at 25 Mb/s) to 200 TB/km => 17 PB. This is an overestimate for us, because not all our shelves hold video, and we have spare copies and VHS 'browse' copies. But it gives a round number: 10 PB for the BBC archive, and similar sizes for other major European broadcast archives.
Our access to this material is selective: about 20% is accessed per year. Is it reasonable to have a 10 PB mass storage system with the majority (like 99% or more) of the drives switched off, applying power only when material on that drive is needed?
Archivists (without the technical knowledge of you lot) are already asking whether we should put hard drives on our shelves, simply because of their low cost.
Is there any basic flaw in designing an archive storage system with "selective" power, to vastly reduce power/cooling requirements?

Reply to this post
Reply [edit]

Poster: illtud Date: May 11, 2004 10:03pm
Forum: petabox Subject: Re: Selective powering of large petasites

Richard,

I would suggest that your needs would best be served by a tape-library, or probably a HSM solution where a portion of the tape-library's content is cached on disk. Presumably a few minutes' wait would not be a problem whilst accessing full-quality bitstreams. Tape libraries can also give you automated tape duplication for offsite storage (disaster recovery) and media refreshing (digital preservation). They also give a lot less problems with regard to power and climate issues (lots of disks equal lots of heat).

Here at the National Library of Wales we've only a smallish (tens of terabytes) tape library, but if cost is not an option, Hitachi or Sony will gladly sell you larger solutions. Your main headache will be the development of the management and cataloging side.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staffbrewster Date: May 11, 2004 11:12pm
Forum: petabox Subject: Re: Selective powering of large petasites

Richard and "illtud"--

Thank you for the notes. We have had some experience with both tapes and hard drives at the Internet Archive and Television Archive, all of which points to the solution of keeping multiple copies and as active as possible.

At the Television Archive, which has holdings closing in on a petabyte, it started on tape and is now recording on hard drives that are kept offline. We dont have much experience on this collection on reading it back except reading back Sept-11-2001 to Sept 18, and it all worked fine.

A bigger tape experiment was trying to read 1000 DLT tapes recorded by the Internet Archive from 1996-1999 and had faults that made some tapes difficult to read and some limited data was lost. It was also very slow to read (took months of an administrators time). Since then, all data is recorded onto hard drives that are kept online.

Disks spinning seem to have a failure rate of 6% per year, but we are working on better measurements. When a disk "fails" it does not always lose data, or sometimes only one block, so recovery can be effective. But this means we should not keep one copy.

Our data protection system is to have at least 2 copies and preferably in distant locations (we have found that human error accounts for real loss as well, so having different administrative bodies helps). We keep copies in San Francisco and at the Library of Alexandria in Egypt.

We are developing the petabox for exactly this reason. It is bottom up designed for reliability, low power, and low cost. The low cost means that we can have 2 or more copies of even large datasets.

I am in Europe for the next 2 months setting up a European Internet Archive that will host those machines in Amsterdam. I would be very interested in talking with anyone about what we are doing if this is of interest.

I can be reached directly at brewster (at) archive.org

-brewster
Digital Librarian

Reply to this post
Reply [edit]

Poster: JTW Date: May 12, 2004 4:42pm
Forum: petabox Subject: Re: Selective powering of large petasites

Like you’ve been saying once you going beyond a couple Terabytes of data most is rarely accessed again or in some cases never again. This is one of the problems I see with very large databases. We have massive servers run 24/7 that only has about 1% of the data stored in it used and 97% of that data was added in the last 3-6 months. But because of the database software we’re running (and a management decision) all the data is stored and always powered on in one large database. But to the part that might be of interest to you, we also store reports long-term.

I’ve been looking into is having a system which has “on demand wake up” functionality for these reports. The “computers” and more importantly Hard Drives spend most of their time turned off in a sleep mode i.e. actually off and using a Network signal to the BIOS to bring them back to life when required. For large archives this could save thousands in power consumption, heat problems and should cut down hard drive failure rates. From a management side of things, if it’s possible to figure out in advance what is going to be accesses least, place them in the this long-term storage computer system, while keeping the more highly demanded data in always on subsystem.

From a topology point of view everything seems to be online 24/7 but in actuality it’s the requests for data that drive what systems are currently powered up. I’m on the prowl to see if anybody else is doing this before I invest time into creating our own solution for feasibility testing with off the shelf components and Linux. Initial with 4 boxes single 100GB drives (400GB total of data) and a master control. This controller will mount all the subsystem with NFS or SAMBA depend on the OS that worked best for hibernation / suspend modes. The idea being it’s only when someone access data in those subdirectories on the master controller that the other computer will power-up. Of course there needs to be some controlling program that knows the location of all the data you have, meaning you can’t just let the user start browsing the network looking for files as the systems will end up starting and stop ever couple of minutes.

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staffbrewster Date: May 12, 2004 5:32pm
Forum: petabox Subject: Re: Selective powering of large petasites

The Library of Alexandria has a copy of much of the web collection of the Internet Archive. They run their systems with a sleep after 3 minutes of inactivity setting. They report it works fine. In a separate test by Bruce Baumgart, he found it takes 9-10 seconds to spin a disk back up.

We have not done a large-scale test of this approach, but it sounds promising for many applications.

The petabox with spun-down disks would save 1/2 the power.

-brewster

Reply to this post
Reply [edit]

Poster: Rob TNA Date: May 12, 2004 11:07pm
Forum: petabox Subject: Re: Selective powering of large petasites

Richard

From the sound of things, you might also want to take a look at the new MAID (Multiple Array of Mostly Idle Disks) devices appearing on the market. The basic idea seems to be that you have a cabinet of 900 250GB SATA disks, where only 25% of the drives are powered on and spinning at any one time.

Reply to this post
Reply [edit]

Poster: Richard BBC Archives Date: May 12, 2004 11:34pm
Forum: petabox Subject: Re: Selective powering of large petasites

Rob TNA, Illtud, Brewster and JTW

Thanks for telling me about MAID. Another reader contacted me offline as well -- so it isn't a daft idea and at least one company COPAN is promoting it commercially:
http://www-conf.slac.stanford.edu/dmw2004/slacworkshop/talks/guha/DMF2000-CopanSystems.pdf

Brewster I'll email directly about your European trip -- thank you very much for the offer.
-Richard BBC

Terms of Use (10 Mar 2001)